Encoding of speech in convolutional layers and the brain stem based on language experience

Comparing artificial neural networks with outputs of neuroimaging techniques has recently seen substantial advances in (computer) vision and text-based language models. Here, we propose a framework to compare biological and artificial neural computations of spoken language representations and propose several new challenges to this paradigm. The proposed technique is based on a similar principle that underlies electroencephalography (EEG): averaging of neural (artificial or biological) activity across neurons in the time domain, and allows to compare encoding of any acoustic property in the brain and in intermediate convolutional layers of an artificial neural network. Our approach allows a direct comparison of responses to a phonetic property in the brain and in deep neural networks that requires no linear transformations between the signals. We argue that the brain stem response (cABR) and the response in intermediate convolutional layers to the exact same stimulus are highly similar without applying any transformations, and we quantify this observation. The proposed technique not only reveals similarities, but also allows for analysis of the encoding of actual acoustic properties in the two signals: we compare peak latency (i) in cABR relative to the stimulus in the brain stem and in (ii) intermediate convolutional layers relative to the input/output in deep convolutional networks. We also examine and compare the effect of prior language exposure on the peak latency in cABR and in intermediate convolutional layers. Substantial similarities in peak latency encoding between the human brain and intermediate convolutional networks emerge based on results from eight trained networks (including a replication experiment). The proposed technique can be used to compare encoding between the human brain and intermediate convolutional layers for any acoustic property and for other neuroimaging techniques.


Model Architecture
. Dimensionality of WaveGAN 2 , the deep neural network examined for this experiment. Given a 100-dimensional random vector z as input, the Generator first applies a fully connected layer and reshapes the output into a 2D matrix, before applying a series of transpose convolutional layers to upsample the input into a 16834-length vector corresponding to audio. The Discriminator takes on almost a mirror architecture to the Generator, passing a 16834-length vector through a series of traditional convolutions to downsample the input, before flattening and passing the output through a fully connected layer to get the final Discriminator score.  Table S2. Estimates of a generalized additive mixed model (fitted with the bam() function in the mgcv package by 3 ). Amplitude in µV from the EEG-cABR data is the dependent variable. The independent variables include LANGUAGE as a parametric predictor (with two levels, English and Spanish with English treatment-coded as the reference level), smooth for Time, by-Language difference smooth for time, and by-subject random smooths. The model includes correction for autocorrelation. Parametric coefficient LANGUAGE = Spanish estimates the overall significance in values of the cABR signal between English and Spanish (expectedly non-significant as there is no reason to expect that amplitude or base level in one of the groups would be different). The smooth term s(Time):LANGUAGE = Spanish estimates the overall significance of the difference smooth for English and Spanish and thus reveal whether the trajectories (smooths) in cABR activity between English and Spanish differ significantly. The actual difference smooth is plotted in Figure 1j.  Table S4. Estimates of the linear model described in Section 4.2 (main paper) with three predictors (LANGUAGE, nTH PERIOD, and REPLICATION with all two-way and three-way interactions) for the Discriminator network.

Generator
In the Generator network, peak latency differs significantly in periods 2, 4, 5 and 7 in the first replication and in periods 1 and 10 in the second replication (unadjusted). If adjusted with FDR, period 4 and 5 are significant in the first replication, and period 1 in the second replication (Table S5).

4/11
Place  Table S5. Pairwise contrasts in peak timing difference between English and Spanish across replications in the Generator network with FDR adjustment (with emmeans package by 4 ). The burst is marked by the 0th period. The 12th period is not estimated due to lack of data.

Discriminator
In the Discriminator, periods 4 and 6 are significant in the first replication and periods 3, 4, 5 and 11 in the second replication (unadjusted). If adjusted with FDR, period 6 is significant in the first replication and period 3 in the second replication (Table  S6).

5/11
Place   Table S10. Estimates of the linear model described in Section 4.2 (main paper) with three predictors (LANGUAGE, nTH PERIOD, and REPLICATION with all two-way and three-way interactions) for the Discriminator network if the waveforms are not converted to absolute values.

Generator
In the Generator network, peak latency differs significantly in periods 1, 3, 4, 6 in the first replication and in periods 6 and 10 in the second replication (unadjusted). If adjusted with FDR, period 1 and 6 are significant in the first replication, and period 6 in the second replication (Table S11).